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Abstract. This note is commenting on Ronald Gallant’s (2015) reflec¬ 
tions on the construction of Bayesian prior distributions from moment 
conditions. The main conclusion is that the paper does not deliver a 
working principle that could justify inference based on such priors. 


1. INTRODUCTION 

The construction of prior distributions has always been a central aspect of 
Bayesian analysis, arguably the central piece since all aspects of Bayesian in¬ 
ference are automatically derived from defining the model, the prior, the data 
and the loss function (Berger, 1985). While the prior is a mathematical object, 
there is not rigorous derivation of a given prior distribution from the available 
information or lack thereof and, while “objective Bayes” constructions are auto¬ 
mated to some extent (Lhoste, 1923; Jeffreys, 1939; Broemeling and Broemeling, 
2003; Berger et ah, 2009), they are rejected by subjectivist Bayesians who argue 
in favour of personalistic and non-reproducible prior selection (Kadane, 2011). 
Defining priors via moment conditions can be traced at least back to Jaynes 
(2003) and the notion of maximum entropy priors^ even though the moment con¬ 
ditions only involve the parameters 9 of the model. The current paper considers 
instead moment conditions as defined jointly on the pair {x,6) and proposes some 
necessary conditions for a prior distribution to be compatible with those condi¬ 
tions. From a foundational perspective, a setting where the joint distribution of 
the data and the parameter that drives this data is a given is hard to fathom, 
because it implies there is no longer a fixed parameter to infer about. In addi¬ 
tion, the term “exogeneity” used in the paper hints at a notion of the parameter 
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being not truly a parameter, but including rather latent variables and maybe 
random effects. It is hard to reconcile this motivation with a computational one 
(also found in the paper) where complex likelihoods would justify calling for a 
prior as a practical tool. The additional reformulation of a (pseudo-)likelihood 
function defined by the method of moments (Section 2) makes the focus of the 
paper difficult to specify and hence analyse. 

Given the introduction through Fisher’s (1930) fiducial distribution, one may 
wonder whether or not the author’s approach is to integrate fiducial constructs 
within the Bayesian paradigm or at the very least to specify under which con¬ 
ditions this can be achieved. As a long-time sceptic on the relevance of fiducial 
arguments, I doubt about one’s ability to produce an arbitrary distribution on 
an equally arbitrary transform of the pair {x,9), instead of a genuine prior x 
likelihood construct. For instance, the discussion around the various meanings of 
the t statistic 

X — 9 



does not imply that it can achieve a t posterior distribution with n — 1 degrees 
of freedom jointly for all sample sizes n and I doubt it can happen outside exotic 
cases like Dirac masses on one of the terms. (Exchanging the randomness of terms 
in a random variable as if it were a linear equation is a guaranteed way to produce 
fiducial paradoxes and measure theoretic difficulties.) 

This set of comments on Gallant (2015) is organised as follows: in Section 
2, I analyse various and somewhat mutually exclusive aspects of the author’s 
approach, while in Section 3, I discuss some computational consequences and 
alternatives. 


2. DERIVING PRIORS FROM MOMENTS 
2.1 Moment default likelihood 

Gallant (2015) considers the distribution of a pivotal quantity like 

Z = y/nW(x, 9)~^^^m(x, 9) 

as induced by the hypothetical joint distribution on (x, 9), hence conversely induc¬ 
ing constraints on this joint, as well as an associated conditional. (The constraints 
may be such that the joint distribution does not exist.) However, this perspective 
is abandoned a few lines below to define a moment likelihood 

p{x\9) = (27r)“*^/’^ exp {-"•/ 2 m(x, 0)"'"[VF(x, 0)]“^m(x, 0)} 

as a quasi-Gaussian pseudo-likelihood in the moment rh{x,9). This is only one 
among many ways of defining a likelihood from moments, but it further removes 
the symmetry in x and 9 induced by the original formulation. In addition, one 
may wonder why a determinant like det{VF(a:, 0)}”^/^ or at least a normalising 
constant (obviously depending on 9) does not appear in the pseudo-likelihood, 
since this impacts the resulting posterior density. 

A connected reference is Zellner’s (1997) Bayesian method of moments where, 
given moment conditions on the parameters 9 and cr^, 

E[9\xi,. .. ,Xn] = Xn , E[cr^|a;i,...] = , var(6'|cr^,xi,...) = , 
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Zellner (1997) derives a maximum entropy posterior 

~ Af{xn,‘^^/n), a~‘^\xi ,... ~ £xp{si ), 

later shown to be incompatible with the corresponding predictive distribution, 
besides producing an inconsistent estimator of (Geisser, 1999).^ 

2.2 Measure-theoretic considerations 

“If one specifies a set of moment functions collected together into a vector m(x, 9) 
of dimension M, regards 6 as random and asserts that some transformation Z{x, 9) 
has distribution ip then what is required to use this information and then possibly 
a prior to make valid inference?” (p.4) 

The central question in the paper is determining whether a set of moment 
equations 

( 1 ) =0 

(where both the XjS and 9 are a priori random) leads to a well-defined pair 
of a likelihood function and a prior distribution compatible with those. From 
a mathematical perspective, this seems to be a highly complex question as it 
implies the integral equation 

[ m{xi,... ,Xn,0)7r{9)f{xi\9) ■ ■ ■ f{xn\9)d6 dxi ■ ■ ■ dxn = 0 
Jexx^ 

must allow for a solution for all n’s. 

Still from a purely mathematical perspective, the problem as stated in Section 

3.3 of Gallant (2015) is puzzling: if the distribution of the transform Z = Z{X, A) 
is provided, what are the consequences on the joint distribution of (X, A)? It 
is conceivable but rather unlikely that this distribution ip will induce a single 
joint, that is, a single prior and a single likelihood. It is much more likely that 
the distribution ijj one arbitrarily selects on m{x, 9) is incompatible with a joint 
distribution on (x,0). To wit, Fisher’s example of the t statistic and of its tn-i 
distribution. 

“Typically C is coarse in the sense that it does not contain all the Borel sets (...) 

The probability space cannot be used for Bayesian inference.” (p.8) 

My understanding of that part of the paper is that dehning a joint on m{x, 9) 
is not always enough to deduce a (unique) posterior on 9, which is fine and cor¬ 
rect, but definitely anticlimactic. This sounds to be what Gallant calls a “partial 
specification of the prior” (p.9). Hence, rather than building the minimal Borel 
fj-algebra on A” x 0 compatible with this joint on m{x,9), I would suggest ex¬ 
amining the range of prior x likelihood pairs that agree with this partial property 
when using the regular Borel u-algebra. 

The general solution found in Section 3.5 (“The Abstraction”) relies on the 
assumptions that Z{-,9) is a surjective function for all 0’s and on the axiom 
of choice, namely that an antecedent of the function can be selected for each 
z G Z, namely, T{z,9) = x* such that Z{x*,9) = z. Under these assumptions, 
Z and T{Z,9) are in one-to-one correspondence and hence can enjoy the same 

^In essence, the prior changes for each sample size n. Geisser (1999) designates the Maxent 
principle per se as the culprit for this incoherence. 
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distribution modulo the proper change of variable. The distribution over X is then 
obtained by assuming a uniform distribution over the orbit of Z{x, 9) = Z{x*,6), 
leading to 

p*ix\9) = ^PiZix,e)) 

as defined in Equation (17). There is no issue about this derivation but, as noted 
previously, there is neither a compelling reason to adopt the smallest cr-algebra 
C* to make the above a proper density in T”. I see little appeal in using this new 
measure and further wonder in which sense this defines a likelihood function, i.e., 
the product of n densities of the XjS conditional on 6. To me this is the central 
issue, which remains unsolved by the paper. 

2.3 Computational motivations 

“A common situation that requires consideration of the notions that follow is that 
deriving the likelihood from a structural model is analytically intractable and one 
cannot verify that the numerical approximations one would have to make to circum¬ 
vent the intractability are sufficiently accurate.” (p.7) 

This computational perspective then is a completely different issue, namely 
that defining a joint distribution by mean of moment equations prevents regular 
Bayesian inference when the likelihood function is intractable. This point of view 
is much more exciting because (i) there are alternatives available, from approx¬ 
imate Bayesian computation (ABC) (Marin et ah, 2011) to INLA (Rue et ah, 

2008), to EP (Barthelme and Chopin, 2014), to variational Bayes (Jaakkola and Jordan, 
2000). In particular, the moment equations are strongly and even insistently sug¬ 
gesting that empirical likelihood techniques (Owen, 2001; Lazar, 2003) could be 
well-suited to this setting. And (ii) it is no longer a mathematical puzzle: there ex¬ 
ists a joint distribution on m(x, 6), induced by one (or many) joint distribution(s) 
on (x, 6). Hence, the question of finding whether or not this item of information 
leads to a single proper prior on 6 becomes irrelevant. However, in the event one 
wants to rely on ABC, being given the distribution of m{x, 6) seems to mean 
one can solely generate new values of this transform while missing a natural dis¬ 
tance between observations and pseudo-observations, although the log-likelihood 
of m(x°'’®, 9) could itself be used as a distance. 

As an aside, the author mentions marginal likelihood estimation by harmonic 
means a la Newton and Raftery (1994), but I would like to point out this usually is 
a rather poor solution with potential for disaster, while it requires the likelihood 
function to be available in closed form. It is also unclear to me why marginal 
likelihood is mentioned at this stage. 

3. A FORM OF ABC? 

“These characteristics are (1) likelihood is not available; (2) prior information is 
available; (3) a portion of the prior information is expressed in terms of functionals 
of the model that cannot be converted into an analytic prior on model parameters; 

(4) the model can be simulated. Our approach depends on an assnmption that (5) an 
adequate statistical model for the data are available.” R. Gallant and R. McCulloch 
(2009) 

As a hnal comment connected with the computational aspect of the current 
paper, I would like to point out Gallant’s and McCulloch’s (2009) connections 
with the ABC approach, to wit the above quote. 
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In Gallant and McCulloch (2009), the true (scientific) model parametrised by 
0 is replaced with a (statistical) substitute that is available in closed form and 
parametrised by g{9). which states that the intractable density is equal to a 
closed-form density.] This latter model is over-parametrised when compared with 
the scientific model. Take, e.g., a M{6, 9"^) scientific model versus a A^(/r, cr^) sta¬ 
tistical model. In addition, the prior information is only available on the param¬ 
eter 9. However, this does not seem to matter very much since (a) the Bayesian 
analysis is operated on 9 only and (b) the Metropolis approach adopted by the 
authors involves simulating a massive number of pseudo-observations, given the 
current value of the parameter 9 and the scientific model, so that the transform 
g{9) can be estimated by maximum likelihood over the statistical model. The 
paper suggests using a secondary Markov chain algorithm to find this MLE. The 
pseudo-model is then used in a primary MCMC step. 

Hence, the approach of Gallant and McCulloch (2009) is not truly an ABC 
algorithm. In the same setting, ABC would indeed use one simulated dataset, 
with the same size as the observed dataset, compute the MLEs for both and 
compare them (as in Drovandi et ah, 2011; Martin et ah, 2014). This approach 
is faster if less accurate when Assumption 1—that the statistical model holds for 
a restricted parametrisation—does not stand. 

4. CONCLUSION 

One overall interrogation about this paper is the validation of the outcome. 
As noted in Eraser (2011), Bayesian posterior distributions are not naturally 
endowed with an epistemic validity. The same questioning obviously applies to 
entities defined outside the Bayesian paradigm, the present one included, in that 
producing a posterior or pseudo-posterior distribution on the parameter offers no 
guarantee per se about the efficiency of the inference it produces. Using asymp¬ 
totically convergent approximations to the likelihood function does not always 
lead to consistent Bayesian approximations (Marin et ah, 2014) and thus requires 
further validation of the procedures proposed here. 

Another global interrogation that remains open is the validation of the income 
outside of the Bayesian paradigm. The production of the equation (1) has to 
occur as a byproduct of defining a joint probability model on the space Ax©, 
which seems to logically exclude both non-Bayesian perspectives and ex nihilo 
occurrences of moment conditions. The only statistics example worked out in the 
paper, namely habit persistence asset pricing, starts with a given prior distribu¬ 
tion (31), which makes this example irrelevant for the stated goal of checking 
whether or not “the assertion of a distribution for moment functions either par¬ 
tially or completely specifies the prior”. 

Despite these difficulties in apprehending the paper postulate, I would like 
to conclude with a more positive perspective, namely that the problematic of 
partially defining models and priors via moment conditions is of considerable 
interest in an era of Big Data, small worlds (Savage, 1954), and limited informa¬ 
tion. Acknowledging that inference and in particular Bayesian inference cannot al¬ 
ways handle big worlds (see, e.g., the paradox exposed in Robins and Wasserman, 
2000) and constructing coherent and efficient tools for restricted inference about 
some aspects of the model are very current questions that beg addressing in full 
generality. 
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